Skip to main content

Reply to this post | Go Back
View Post [edit]

Poster: dmoynihan Date: Feb 25, 2005 6:25am
Forum: toronto Subject: "processed images?"

Just finished proofed HTML versions of The Queen's Necklace and Taking the Bastille (that's just a revised title using your graphics... seems identical) that the Archive of course is welcome to, but I noticed that if you do an OCR of the "Processed image," the text quality suffers greatly (when compared to an OCR of the "Unprocessed image")

Any ideas what's going on here? Lots of weird itals, missing letters, breaks, etc. that don't show up in the unprocessed version. It's a little faster to work with the processed images, which is why I tried it for 60 pages, but maybe I should go with a faster computer or something.

Reply to this post
Reply [edit]

Poster: molly Date: Mar 3, 2005 7:28am
Forum: toronto Subject: Re: 'processed images?'

What package are you using to do your OCR? Our unproofed OCR comes from making the DJVU derivatives, and admittedly isn't the best around. We do like that it outputs a piece of XML that records the location of the text on each page with bounding boxes.

I'm curious if you are getting better results because you are just using a better OCR package, or because you are using the higher resolution image.

HP Labs is kindly working on getting us automatically indexed searchable PDFs, and they are using Abbyy Fine Reader. Our OCR should improve when those start to stream in to the collection.

At some point, we'd like to create a module to take volunteer's hand corrected OCR and work it back into the XML that DJVU puts out. But all of these tools will come in good time!

Thanks for doing such great work!


Reply to this post
Reply [edit]

Poster: Greg Lindahl Date: Apr 16, 2005 2:36pm
Forum: toronto Subject: Re: 'processed images?'

Hopefully you'll use the Finereader OCR in your DJVU files, too -- the PG Distributed Proofreaders community experience with Finereader, and it really rocks. Having only PDFs with the good OCR would be a good start, though. Imagine clustering books using the raw OCR...